-
Notifications
You must be signed in to change notification settings - Fork 82
add device data in resourceclaim #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add device data in resourceclaim #128
Conversation
|
Welcome @guptaNswati! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: guptaNswati The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
cc @nojnhuh for review. |
| return resultConfigs, nil | ||
| } | ||
|
|
||
| func (s *DeviceState) buildDeviceStatus(res *resourceapi.DeviceRequestAllocationResult) *resourceapply.AllocatedDeviceStatusApplyConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does res need to be a pointer, or can we save the extra syntax and theoretical nil-dereference by making this a plain value? Not a blocking issue, but I may propose refactoring this when I integrate with #129.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
|
||
| func (s *DeviceState) buildDeviceStatus(res *resourceapi.DeviceRequestAllocationResult) *resourceapply.AllocatedDeviceStatusApplyConfiguration { | ||
| dn := res.Device | ||
| deviceInfo := make(map[string]interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we make this more strongly typed?
| deviceInfo := make(map[string]interface{}) | |
| deviceInfo := make(map[string]resourceapi.DeviceAttribute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| WithDriver(res.Driver). | ||
| WithPool(res.Pool). | ||
| WithConditions(cond). | ||
| WithData(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, who or what generally consumes this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this seems useful for monitoring and debugging when admin want to know pod to gpu mapping, which GPU is being used by a particular pod. right now this info is not readily available.
| } | ||
| cond := metav1apply.Condition(). | ||
| WithType("Ready"). | ||
| WithStatus(metav1.ConditionTrue). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the NVIDIA driver also only set this condition at the time the claim is allocated, or can it ever change after that? If it can change, I think the example driver should model how to update that. Even if it stays constant, scaffolding out how that can work with something that says "this is where the latest status is determined" would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for nvidia-dra-driver i was looking to use this as part of GPU health check status update. But we dint really see any value of showing this information as part of an allocated claim as it does not really change. Its more complex to show any real updates on an allocated claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think for setting device conditions, the interesting part isn't the API call to update the condition, but actually calculating that value of the condition asynchronously. I think I'd rather include that (or no condition at all) than a hardcoded condition added during NodePrepareResources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay since we don't have device health checking rn, i can just change it to just allocated for example purposes with a comment or remove it.
// TODO: This condition currently reflects only that the device was allocated, not that
// it is healthy or ready for use. True readiness should be computed asynchronously by
// the health-monitoring pipeline (e.g., NVML Xid/error tracking). In the future, this
// should be replaced—or augmented—with a condition whose value is driven by actual
// device health state rather than allocation alone.
cond := metav1apply.Condition().
WithType("Allocated").
WithStatus(metav1.ConditionTrue).
WithReason("GPUDeviceAllocated").
WithMessage("GPU device allocated for this request").
WithLastTransitionTime(metav1.Now())
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still prefer to either omit this condition entirely or create all of the scaffolding around driving the condition asynchronously. I don't think we should be implementing patterns that we don't recommend, even if we document the right way.
If we decide to omit the condition here, let's open an issue to track adding async conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay sounds good. than i will remove it for now as it requires more work and i will open an issue and work on it.
|
|
||
| opts := metav1.ApplyOptions{FieldManager: consts.DriverName, Force: true} | ||
|
|
||
| _, err := s.config.coreclient.ResourceV1().ResourceClaims(ns).ApplyStatus(ctx, claim, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some lightweight verification we can add to the e2e tests? At least something to verify that something like the condition or one of the attributes is set on even one of the examples would make sure this doesn't totally break later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
| if d.Attributes != nil { | ||
| attributes := d.Attributes | ||
|
|
||
| if uuid, ok := attributes["uuid"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indexing into a nil map doesn't panic, so we can skip this check. I at least don't see any issues skipping it if I remove all of the attributes from the devices and allocate them.
| if d.Attributes != nil { | |
| attributes := d.Attributes | |
| if uuid, ok := attributes["uuid"]; ok { | |
| if uuid, ok := d.Attributes["uuid"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| attributes := d.Attributes | ||
|
|
||
| if uuid, ok := attributes["uuid"]; ok { | ||
| deviceInfo["uuid"] = uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look at this, I'm not sure I see the value in propagating attributes that already exist in the ResourceSlice. Is there a common use case where deriving these attributes from the ResourceSlice doesn't work?
Or is there some other data we can use here? Maybe like the time-slicing/space-partitioning config? I could see that being useful to model the opaque config being like a request, then the device status would represent the response in cases where a given config might result in several different outcomes. Kind of like the regular spec/status model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. i need to see whats more valuable to add here. My thought was to have this as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I suppose attributes here could be useful in case the device is removed from a ResourceSlice because it became unhealthy or otherwise. Let's keep them here, but can we include a comment explaining when recording the attributes in the status is useful?
| } | ||
| cond := metav1apply.Condition(). | ||
| WithType("Ready"). | ||
| WithStatus(metav1.ConditionTrue). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think for setting device conditions, the interesting part isn't the API call to update the condition, but actually calculating that value of the condition asynchronously. I think I'd rather include that (or no condition at all) than a hardcoded condition added during NodePrepareResources.
|
FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here! |
Oh i see. thank you for the headsup. i will see if i can update this tomorrow or else you can go ahead and merge #129. Main blocker is the e2e, right? we can't merge this without the tests? |
|
@guptaNswati Yes, I would like to include something in the tests for this. #128 (comment) is the other comment I would like to resolve before merging. |
|
@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened. |
sorry pushing them rn. |
|
this is the quick test with updated changes. fixing e2e test |
|
e2e test |
c98e061 to
a03108c
Compare
Signed-off-by: Swati Gupta <[email protected]> address review comment: fix pointer ref Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
a03108c to
b77e105
Compare
Addressing #101 to add device attributes to demonstrate how to do resourceclaim status update.
Test run